fix(gke-hyperdisk-test): Replace the tensor flow jobs with FIO jobs #4683

Neelabh94 · 2025-09-22T11:24:50Z

Rationale for the Change
The original implementation used TensorFlow jobs to test the GKE Hyperdisk volumes. However, the test has been failing continuously due to changes in dependency module versions (e.g., transformers and datasets).
Pinning dependency versions was deemed an inadequate fix, as the test should not rely on a specific version of a non-core component like the TensorFlow example.

Implementation Details
Following an offline discussion, the approach has been changed to replace the problematic TensorFlow jobs with FIO (Flexible I/O Tester) jobs. FIO is a standard benchmarking and stress tool used to measure I/O performance.

The change in examples/gke-managed-hyperdisk.yaml replaces the three separate TensorFlow jobs (for Hyperdisk Balanced, Extreme, and Throughput) with a single FIO-based benchmark job.
The new FIO job is configured to perform I/O tests against the Hyperdisk volumes, including a randrw test for Balanced, a randwrite test for Extreme, and a write test for Throughput.
The deployment now uses a simpler ubuntu:latest image, installs FIO, and executes the benchmarks.
The validation logic in tools/cloud-build/daily-tests/ansible_playbooks/test-validation/test-gke-managed-hyperdisk.yml has been updated to specifically run the single FIO job, wait for its completion, fetch the pod logs, and print the FIO test results.
This approach ensures the test focuses squarely on the I/O performance and mounting of the Hyperdisk volumes rather than being dependent on external machine learning libraries and their changing APIs.

Submission Checklist

NOTE: Community submissions can take up to 2 weeks to be reviewed.

Please take the following actions before submitting this pull request.

Fork your PR branch from the Toolkit "develop" branch (not main)
Test all changes with pre-commit in a local branch #
Confirm that "make tests" passes all tests
Add or modify unit tests to cover code changes
Ensure that unit test coverage remains above 80%
Update all applicable documentation
Follow Cluster Toolkit Contribution guidelines #

bytetwin

Can you point to the RCA on how pinning the dependency version succeeds the test. The test should not be dependent on a specific version of transformer as its not the core part of the blueprint to test.
Moreover cloud docs for running tensorflow examples are not pinning the dependencies.

Neelabh94 · 2025-10-07T03:36:50Z

Can you point to the RCA on how pinning the dependency version succeeds the test. The test should not be dependent on a specific version of transformer as its not the core part of the blueprint to test. Moreover cloud docs for running tensorflow examples are not pinning the dependencies.

The test uses these python libraries in a job it executes. The job (a python code) breaks due to change in the internal behavior of these libraries. Pinning them ensure the code behaves in a consistent manner.
I am not fully aware about the update strategy for tensorflow cloud docs but I do believe any breaking change must warrant a change to their doc(s) / example(s) too.

bytetwin

Failures in test should not be propagated as changes to blueprints. The test needs to be fixed to accomodate to this breaking change. Customers when using these models for their workloads will not be using the same version that is pinned here and will find discrepancies

Neelabh94 · 2025-10-08T08:23:02Z

Failures in test should not be propagated as changes to blueprints. The test needs to be fixed to accomodate to this breaking change. Customers when using these models for their workloads will not be using the same version that is pinned here and will find discrepancies

As per our discussion offline, will work on replacing the tensorflow jobs with FIO jobs.

Discussed offline

…ample blueprint Replaces the tensorflow jobs with FIO jobs in the `gke-managed-hyperdisk` example and accordingly updates its correspoding test playbook.

Neelabh94 · 2025-10-28T04:12:43Z

/gemini review

gemini-code-assist

Code Review

This pull request is a great improvement, replacing the flaky TensorFlow tests with more stable and focused FIO benchmarks. The implementation looks solid. I have a few suggestions to improve security, reproducibility, and robustness of the test scripts.

In gke-managed-hyperdisk.yaml, I've pointed out a security concern with running the container as root and suggested pinning the ubuntu image version for reproducibility. I also recommended adding a trap to the bash script to ensure cleanup happens even if a benchmark fails.

In test-gke-managed-hyperdisk.yml, I've suggested a more robust way to wait for the job completion and a cleanup of the logging tasks to avoid redundancy.

examples/gke-managed-hyperdisk.yaml

tools/cloud-build/daily-tests/ansible_playbooks/test-validation/test-gke-managed-hyperdisk.yml

examples/gke-managed-hyperdisk.yaml

Neelabh94 added the test-enhancement Tests enhancement or coverage improvement label Sep 22, 2025

Neelabh94 force-pushed the bugfix/gke-hyperdisk-test-failure branch from dd19029 to 11b9cc6 Compare September 24, 2025 05:29

Neelabh94 self-assigned this Sep 24, 2025

Neelabh94 changed the title ~~fix(gke): add compute.admin role to GKE node pool service account~~ fix(gke-hyperdisk-test): Pin the dependency versions Sep 24, 2025

Neelabh94 force-pushed the bugfix/gke-hyperdisk-test-failure branch from 5eca840 to b6a98ca Compare September 24, 2025 06:23

Neelabh94 marked this pull request as ready for review September 24, 2025 09:47

Neelabh94 requested review from a team and samskillman as code owners September 24, 2025 09:47

Neelabh94 added the release-chore To not include into release notes label Sep 24, 2025

bytetwin requested changes Sep 29, 2025

View reviewed changes

bytetwin previously requested changes Oct 7, 2025

View reviewed changes

Neelabh94 added do-not-merge Block merging of this PR and removed test-enhancement Tests enhancement or coverage improvement labels Oct 8, 2025

Neelabh94 marked this pull request as draft October 8, 2025 08:27

Neelabh94 force-pushed the bugfix/gke-hyperdisk-test-failure branch 3 times, most recently from 5983077 to 60e5181 Compare October 16, 2025 07:31

Neelabh94 changed the title ~~fix(gke-hyperdisk-test): Pin the dependency versions~~ fix(gke-hyperdisk-test): Replace the tensor flow jobs with FIO jobs Oct 16, 2025

Neelabh94 force-pushed the bugfix/gke-hyperdisk-test-failure branch 3 times, most recently from 61351ab to ed694da Compare October 16, 2025 15:50

Neelabh94 removed the do-not-merge Block merging of this PR label Oct 16, 2025

Neelabh94 force-pushed the bugfix/gke-hyperdisk-test-failure branch 2 times, most recently from 413f662 to 0daf86c Compare October 16, 2025 16:34

Neelabh94 marked this pull request as ready for review October 16, 2025 17:06

Neelabh94 force-pushed the bugfix/gke-hyperdisk-test-failure branch 2 times, most recently from ba47760 to e40b125 Compare October 17, 2025 02:47

Neelabh94 force-pushed the bugfix/gke-hyperdisk-test-failure branch 2 times, most recently from 5c579e7 to e892a94 Compare October 23, 2025 02:43

Neelabh94 marked this pull request as draft October 24, 2025 11:10

fix(gke-hyperdisk): Update gke-managed-hyperdisk test playbook and ex…

2f7f378

…ample blueprint Replaces the tensorflow jobs with FIO jobs in the `gke-managed-hyperdisk` example and accordingly updates its correspoding test playbook.

Neelabh94 force-pushed the bugfix/gke-hyperdisk-test-failure branch from 0ee1ef2 to 2f7f378 Compare October 24, 2025 12:12

Neelabh94 marked this pull request as ready for review October 24, 2025 12:55

gemini-code-assist bot reviewed Oct 28, 2025

View reviewed changes

Fix: Apply suggestions from gemini-code-assist

8691e19

bytetwin approved these changes Oct 30, 2025

View reviewed changes

examples/gke-managed-hyperdisk.yaml Show resolved Hide resolved

Neelabh94 merged commit 3405b56 into GoogleCloudPlatform:develop Oct 30, 2025
13 of 66 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(gke-hyperdisk-test): Replace the tensor flow jobs with FIO jobs #4683

fix(gke-hyperdisk-test): Replace the tensor flow jobs with FIO jobs #4683

Uh oh!

Neelabh94 commented Sep 22, 2025 •

edited

Loading

Uh oh!

bytetwin left a comment

Uh oh!

Neelabh94 commented Oct 7, 2025

Uh oh!

bytetwin left a comment

Uh oh!

Neelabh94 commented Oct 8, 2025

Uh oh!

Neelabh94 commented Oct 28, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

fix(gke-hyperdisk-test): Replace the tensor flow jobs with FIO jobs #4683

fix(gke-hyperdisk-test): Replace the tensor flow jobs with FIO jobs #4683

Uh oh!

Conversation

Neelabh94 commented Sep 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Submission Checklist

Uh oh!

bytetwin left a comment

Choose a reason for hiding this comment

Uh oh!

Neelabh94 commented Oct 7, 2025

Uh oh!

bytetwin left a comment

Choose a reason for hiding this comment

Uh oh!

Neelabh94 commented Oct 8, 2025

Uh oh!

Neelabh94 commented Oct 28, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Neelabh94 commented Sep 22, 2025 •

edited

Loading